An Application of Operational Research to Computational Linguistics: Word Ambiguity
نویسندگان
چکیده
This paper draws on graph theory and optimization techniques to develop a new measure of word ambiguity (e.g., homonymy and polysemy) for use in psycholinguistic research. This measure provides information regarding the uncertainty of the intended meaning of English words. Specifically, data about sixty-four thousand distinct words was collected from a corpus of close to three hundred million words. These data are used to generate information about word association which forms a basis for the creation of semantic graphs from which clusters are created and analyzed. The clusters identify groups of words related to the different meanings of a word and are used to calculate a set of relative probabilities for the meanings. These are in turn used to calculate the information entropy for the word, which acts as a surrogate measure of ambiguity. A genetic algorithm is used to optimally determine parameters for our formula for word association and for the graph clustering algorithm. The effectiveness of this application is demonstrated with examples from psycholinguistic research. keywords: computational linguistics, word ambiguity, graph clustering, genetic algorithm
منابع مشابه
AXEL : a framework to deal with ambiguity in three-noun compounds
Cognitive Linguistics has been widely used to deal with the ambiguity generated by words in combination. Although this domain offers many solutions to address this challenge, not all of them can be implemented in a computational environment. The Dynamic Construal of Meaning framework is argued to have this ability because it describes an intrinsic degree of association of meanings, which in tur...
متن کاملA Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception
This paper proposes that the process of language understanding can be modeled as a collective phenomenon that emerges from a myriad of microscopic and diverse activities. The process is analogous to the crystallization process in chemistry. The essential features of this model are: asynchronous parallelism; temperature-controlled randomness; and statistically emergent active symbols. A computer...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملLearning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew
This paper proposes a new approach for acquiring morpho-lexical probabilities from an untagged corpus. This approach demonstrates a way to extract very useful and nontrivial information from an untagged corpus, which otherwise would require laborious tagging of large corpora. The paper describes the use of these morpho-lexical probabilities as an information source for morphological disambiguat...
متن کاملStudying impressive parameters on the performance of Persian probabilistic context free grammar parser
In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- INFOR
دوره 48 شماره
صفحات -
تاریخ انتشار 2010